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Abstract 

Learning-assisted automated reasoning has recently gained 
popularity among the users of Isabelle/HOL, HOL Light, and 
Mizar. In this paper, we present an add-on to the HOL4 proof 
assistant and an adaptation of the HOL(y)Hammer system 
that provides machine learning-based premise selection and 
automated reasoning also for HOL4. We efficiently record 
the HOL4 dependencies and extract features from the the¬ 
orem statements, which form a basis for premise selection. 
HOL(y)Hammer transforms the HOL4 statements in the var¬ 
ious TPTP-ATP proof formats, which are then processed by 
the ATPs. 

We discuss the different evaluation settings: ATPs, acces¬ 
sible lemmas, and premise numbers. We measure the perfor¬ 
mance of HOL(y)Hammer on the HOL4 standard library. The 
results are combined accordingly and compared with the 
HOL Light experiments, showing a comparably high qual¬ 
ity of predictions. The system directly benefits HOL4 users 
by automatically finding proofs dependencies that can be 
reconstructed by Metis. 

Categories and Subject Descriptors 1.2.3 [Artificial in¬ 
telligence]: Inference engines 

Keywords HOL4; higher-order logic; automated reason¬ 
ing; premise selection 

1. Introduction 

The HOL4 proof assistant provides its users with a full 
ML programming environment in the LCF tradition. Its sim¬ 
ple logical kernel and interactive interface allow safe and 
fast developments, while the built-in decision procedures 
can automatically establish many simple theorems, leaving 
only the harder goals to its users. However, manually prov¬ 
ing theorems based on its simple rules is a tedious task. 
Therefore, general purpose automation has been developed 
internally, based on model elimination (MESON d) , tableau 
(blast [id)) or resolution (Metis [O])- Although essential to 
HOL4 developers, the methods are so far not able to com¬ 
pete with the external ATPs [l^.[^ optimized for fast proof 
search with many axioms present and continuously evalu- 
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ated on the TPTP library [Id and updated with the most 
successful techniques. The TPTP (Thousands of Problems 
for Theorem Provers) is a library of test problems for auto¬ 
mated theorem proving (ATP) systems. This standard en¬ 
ables convenient communication between different systems 
and researchers. 

On the other hand, the H0L4 system provides a func¬ 
tionality to search the database for theorems that match a 
user chosen pattern. The search is semi-automatic and the 
resulting lemmas are not necessarily helpful in proving the 
conjecture. An approach that combines the two: searching 
for relevant theorems and using automated reasoning meth¬ 
ods to (pseudo-)minimize the set of premises necessary to 
solve the goal, forms the basis of “hammer” systems such as 
Sledgehammer [Id for Isabelle/HOL, HOL(y)Hammer [Id for 
HOL Light or MizAR for Mizar [Td- Furthermore, apart from 
syntactic similarity of a goal to known facts, the relevance of 
a fact can be learned by analyzing dependencies in previous 
proofs using machine learning techniques [Id, which leads 
to a significant increase in the power of such systems El- 

In this paper, we adapt the HOL(y)Hammer system to the 
H0L4 system and test its performance on the H0L4 standard 
library. The libraries of H0L4 and HOL Light are exported 
together with proof dependencies and theorem statement 
features; the predictors learn from the dependencies and 
the features to be able to produce lemmas relevant to a 
conjecture. Each problem is translated to the TPTP FOF 
format. When an ATP finds a proof, the necessary premises 
are extracted. They are read back to H0L4 as proof advice 
and given to Metis for reconstruction. 

An adapted version of the resulting software is made 
available to the users of H0L4 in interactive session, which 
can be used in newly developed theories. Given a conjecture, 
the SML function computes every step of the interaction loop 
and, if successful, returns the conjecture as a theorem: 

Example 1. (HOL(y)Hammer interactive call) 

load "holyHammer"; 

val it = 0: unit 
holyhanuner ‘ ‘ 1+1=2 ‘ ‘ ; 

Relevant theorems: ALT_ZER0 ONE TWO ADDl 

metis: r[+0+6]# 

val it= |-1+1=2: thm 


The H0L4 prover already benefits from export to SMT 
solvers such as Vices [^ , Z3 0] and Beagle a. These meth¬ 
ods perform best when solving problems from the supported 
theories of the SMT solver. Comparatively, HOL(y)Hammer 
is a general purpose tool as it relies on ATPs without theory 


reasoning and it can provide easiljQ re-provable problem to 

Metis. 

The H0L4 standard distribntion has since long been 
equipped with proof recording kernels [3, . We first con¬ 

sidered adapting these kernels for our aim. But as machine 
learning only needs the proof dependencies and the approach 
based on fnll proof recording is not efficient, we perform min¬ 
imal modifications to the original kernel. 

Contributions We provide learning assisted automated 
reasoning for H0L4 and evaluate its performance in com¬ 
parison to that in HOL Light. In order to do so, we : 

• Export the H0L4 data 

Theorems, dependencies, and featnres are exported by a 
patched version of the H0L4 kernel. It can record depen¬ 
dencies between theorems and keep track on how their 
conjunctions are handled along the proof. We export the 
H0L4 standard libraries (58 types, 2305 constants, 11972 
theorems) with respect to a strict name-space rule so that 
each object is uniquely identifiable, preserving if possible 
its original name. 

• Reprove 

We test the ability of a selection of external provers to 
reprove theorems from their dependencies. 

• Define accessibility relations 

We define and simulate different development environ¬ 
ments, with different sets of accessible facts to prove a 
theorem. 

• Experiment with predictors 

Given a theorem and a accessibility relation, we use 
machine learning techniques to find relevant lemmas from 
the accessible sets. Next, we measure the quality of the 
predictions by running ATPs on the translated problems. 

The rest of this paper is organized as follows. In Section[2] 
we describe the export of the HOL4 and HOL Light data 
into a common format and the recording of dependencies 
in HOL4. In Section[31 we present the different parameters: 
ATPs, proving environments, accessible sets, features, and 
predictions. We select some of them for our experiments and 
justify our choice. In Section 0] we present the results of the 
HOL4 experiments, relate them to previous HOL(y)Hammer 
experiments and explain how this affects the users. Finally in 
Section 0] we conclude and present an outlook on the future 
work. 

2. Sharing HOL data between HOL4, HOL 
Light and HOL(y)Hammer 

In order to process HOL Light and HOL4 data in a uniform 
way in HOL(y)Hammer, we export objects from their respec¬ 
tive theories, as well as dependencies between theorems into 
a common format. The export is available for any HOL4 
and HOL Light development. We shortly describe the com¬ 
mon format used for exporting both libraries and present 
in more detail our methods for efficiently recording objects 
(types, constants and theorems) and precise dependencies 
in HOL4. We will refer to HOL(y)Hammer [T^ for the de¬ 
tails on recording objects and dependencies for HOL Light 
formalizations. 

HOL Light and HOL4 share a common logic (higher-order 
logic with implicit shallow polymorphism), however their 

^ reconstruction rate is typically above 90% 


implementations differ both in terms of the programming 
language used (OCamI and SML respectively), data struc¬ 
tures used to represent the terms and theorems (higher- 
order abstract syntax and de Bruijn indices respectively), 
and the exact inference rules provided by the kernel. As 
HOL(y)Hammer has been initially implemented in OCamI as 
an extension of HOL Light, we need to export all the HOL4 
data and read it back into HOL(y)Hammer, replacing its type 
and constant tables. The format that we chose is based on 
the TPTP THFO format [13] used by higher-order ATPs. 
Since formulas contains polymorphic constants which is not 
supported by the THFO format, we will present an experi¬ 
mental extension of this format where the type arguments 
of polymorphic constants are given explicitly. 

Example 2. (experimental template) 
ttCname, role, formula) 

The field name is the object’s name. The field role is ”ty” 
if the object is a constant or a type, and ”ax” if the object 
is a theorem. The field formula is an experimental THFO 
formula. 

Example 3. (Object export from HOL4 to an experimental 
format) 

• Type 

(list.l) —> ttClist, ty, $t > $t). 

• Constant 

(HD, “ : ’a list -> : ’a“) ->■ 

tt(HD, ty, ! [A:$t]: (list A > A). 

(C0NS,“:’a -> :’a list -> :’a list") 

tt(C0NS ,ty, ! [A:$t]: (A > list A > list A). 

• Theorem 

(HD,‘‘V n:int t:list[int]. HD (CONS n t) = n‘‘) 
tt(HD0, ax, (![n:int, t:(list int)]: 

((HD int) ((CONS int) n t) = n). 

In this example, $t is the type of all basic types. 

All names of objects are prefixed by a namespace identi¬ 
fier, that allow identifying the prover and theory they have 
been defined in. For readability, the namespace prefixes have 
been omitted in all examples in this paper. 

2.1 Creation of a HOL4 theory 

In HOL4, types and constants can be created and deleted 
during the development of a theory. These objects are named 
at the moment they are created. A theorem is a SML value 
of type thm and can be derived from a set of basic rules, 
which is an instance of a typed higher-order classical logic. 
To distinguish between important lemmas and theorems 
created by each small steps, the user can name and delete 
theorems (erase the name). Each named object still present 
at the end of the development is saved and thus can be called 
in future theories. 

There are two ways in which an object can be lost in a 
theory: either it is deleted or overwritten. As proof depen¬ 
dencies for machine learning get more accurate when more 
intermediate steps are available, we decided to record all cre¬ 
ated objects, which results in the creation of slightly bigger 
theories. As the originally saved objects can be called from 



other theories, their names are preserved by our transforma¬ 
tion. Each lost object whose given name conflicts with the 
name of a saved object of the same type is renamed. 

Deleted objects The possibility of deleting an object or 
even a theory is mainly here to hide internal steps or to make 
the theory look nicer. We chose to remove this possibility 
by canceling the effects of the deleting functions. This is 
the only user-visible feature that behaves differently in our 
dependency recording kernel. 

Overwritten objects An object may be overwritten in the 
development. As we prevent objects from being deleted, the 
likelihood of this happening is increased. This typically hap¬ 
pens when a generalized version of a theorem is proved and 
is given the same name as the initial theorem. In the case of 
types and constants, the internal H0L4 mechanism already 
renames overwritten objects. Conversely, theorems are really 
erased. To avoid dependencies to theorems that have been 
overwritten, we automatically rename the theorems that are 
about to be overwritten. 

2.2 Recording dependencies 

Dependencies are an essential part of machine learning for 
theorem proving, as they provide the examples on which pre¬ 
dictors can be trained. We focus on recording dependencies 
between named theorems, since they are directly accessible 
to a user. The time mark of our method slows down the 
application of any rules by a negligible amount. 

Since the statements of 951 H0L4 theorems are conjunc¬ 
tions, sometimes consisting of many toplevel conjuncts, we 
have refined our method to record dependencies between the 
toplevel conjuncts of named theorems. 

Example 4. (Dependencies between conjunctions) 

ADD_CLAUSES : 0+m=mAm+0=mA 

sue m + n = sue (m + n) A m + SUC n = SUC (m + n) 

ADD_ASSDe depends on: 

ADD_eLAUSES_cl: 0 + m = m 
ADD_eLAUSES_c3: SUe m + n = SUC (m + n) 


The conjunct identifiers of a named theorem T are noted 

T_cl, ..., T_cN. 

In certain theorems, a toplevel universal quantifier shares 
a number of conjuncts. We will also split the conjunctions in 
such cases recursively. This type of theorem is less frequent 
in the standard library (203 theorems). 

Example 5. (Conjunctions under quantifier) 

MIN_0: V n. (MIN n 0 = 0) A (MIN 0 n = 0) 

By splitting conjunctions we expect to make the depen¬ 
dencies used as training examples for machine learning more 
precise in two directions. First, even if a theorem is too hard 
to prove for the ATPs, some of its conjuncts might be prov¬ 
able. Second, if a theorem depends on a big conjunction, it 
typically depends only on some of its conjuncts. Even if the 
precise conjuncts are not clear from the human-proof, the 
reproving methods can often minimize the used conjuncts. 
Furthermore, reducing the number of conjuncts should ease 
the reconstruction. 

2.3 Implementation of the recording 

The HOL4 type of theorems thm includes a tag field in order 
to remember which oracles and axioms were necessary to 


prove a theorem. Each call to an oracle or axiom creates 
a theorem with the associated tag. When applying a rule, 
all oracles and axioms from the tag of the parents are 
respectively merged, and given to the conclusion of the 
rule. To record the dependencies, we added a third held to 
the tag, which consists of a dependency identifier and its 
dependencies. 

Example 6. (Modified tag type) 

type tag = ((dependency_id, dependencies), 
oracles, axioms) 

type thm = (tag, hypotheses, conclusion) 

Since the name of a theorem may change when it is 
overwritten, we create unmodifiable unique identifiers at the 
moment a theorem is named. 

It consists of the name of the current theory and the 
number of previously named theorems in this theory. As 
a side effect, this enables us to know the order in which 
theorems are named which is compatible by construction 
with the pre-order given by the dependencies. Every variable 
of type thm which is not named is given the identifier 
unnamed. Only identifiers of named theorems will appear 
in the dependencies. 

We have implemented two versions of the dependency 
recording algorithm, one that tracks the dependencies be¬ 
tween named theorems, other one tracking dependencies 
between their conjuncts. For the named theorems, the de¬ 
pendencies are a set of identified theorems used to prove 
the theorem. The recording is done by specifying how each 
rule creates the tag of the conclusion from the tag of its 
premises. The dependencies of the conclusion are the union 
of the dependencies of the unnamed premises with its named 
premises. 

This is achieved by a simple modification of the Tag. merge 
function already applied to the tags of the premises in each 
rule. 

When a theorem h A A R is derived from the theorems 
h A and h B, the previously described algorithm would 
make the dependencies of this theorem the union of the 
dependencies of the two. If later other theorems refer to it, 
they will get the union as their dependencies, even if only 
one conjunct contributes to the proof. In this subsection we 
define some heuristics that allow more precise tracking of 
dependencies of the conjuncts of the theorems. 

In order to record the dependencies between the con¬ 
juncts, we do not record the conjuncts of named theorems, 
but only store their dependencies in the tags. The dependen¬ 
cies are represented as a tree, in which each leaf is a set of 
conjunct identifiers (identifier and the conjunct’s address). 
Each leaf of the tree represents the respective conjunct a in 
the theorem tree and each conjunct identifier represents a 
conjunct of a named goal to prove d. 

Example 7. (An example of a theorem and its dependencies) 

ThO (named theorem): A A B 
Thl: C A (DA E) 

with dependency tree Tree([ThO],[Th0_c2]) 

This encodes the fact that: 

C depends on ThO. 

DAE depends Th0_c2 which is B. 



Dependencies are combined at each inference rule appli¬ 
cation and dependencies will contain only conjunct identi¬ 
fiers. If not specified, a premise will pass on its identifier if 
it is a named conjunct (conjunct of a named theorem) and 
its dependency tree otherwise. We call such trees passed de¬ 
pendencies. The idea is that the dependencies of a named 
conjunct should not transmit its dependencies to its children 
but itself. Indeed, we want to record the direct dependencies 
and not the transitive ones. 

For rules that do not preserve the structure of conjunc¬ 
tions, we flatten the dependencies, i.e. we return a root 
tree containing the set of all (conjunct) identifiers in the 
passed dependencies. We additionally treat specially the 
rules used for the top level organization of conjunctions: 
CONJ, CDNJUNCTl, CDNJUNCT2, GEN, SPEC, and SUBST. 


• CONJ: It returns a tree with two branches, consisting of 
the passed dependencies of its first and second premise. 

• CONJUNCTl (C0NJUNCT2): If its premise is named, then 
the conjunct is given a conjunct identifier. Otherwise, 
the first (second) branch of the dependency tree of its 
premise become the dependencies of its conclusion. 

• GEN and SPEC: The tags are unchanged by the application 
of those rules as they do not change the structure of 
conjunctions. Although we have to be careful when using 
SPEC on named theorems as it may create unwanted 
conjunctions. These virtual conjunctions are not harmful 
as the right level of splitting is restored during the next 
phase. 

Example 8. (Creation of a virtual conjunction from a 
named theorem) 


V x.x h V x.x 

V x.x h A A B 

V x.x h A 


SPEC [A A B] 
CONJUNCTl 


• SUBST: Its premises consist of a theorem, a list of substitu¬ 
tion theorems of the form {A = B) and a template that 
tells where each substitution should be applied. When 
SUBST preserves the structure of conjuncts, the set of all 
identifiers in the passed dependencies of the substitution 
theorems is distributed over each leaf of the tree given 
by the passed dependencies of the substituted theorems. 
When it is not the case the dependency should be flat¬ 
tened. Since the substitution of sub-terms below the top 
formula level does not affect the structure of conjunc¬ 
tions, it is sufficient (although not necessary) to check 
that no variables in the template is a predicate (is a 
boolean or returns a boolean). 

The heuristics presented above try to preserve the depen¬ 
dencies associated with single conjuncts whenever possible. 
It is of course possible to find more advanced heuristics, that 
would give more precise human-proof dependencies. How¬ 
ever, performing more advanced operations (even pattern 
matching) may slow down the proof system too much; so we 
decided to restrict to the above heuristics. 

Before exporting the theorems, we split them by recur¬ 
sively distributing quantifiers and splitting conjunctions. 
This gives rise to conflicting degree of splitting, as for in¬ 
stance, a theorem with many conjunctions may have been 
used as a whole during a proof. Given a theorem and its 
dependency tree, each of its conjunctions is given the set of 


Prover 

Version 

Premises 

Vampire 

2.6 

96 

E-prover 

1.8 

128 

z3 

4.32 

32 

CVC4 

1.3 

128 

Spass 

3.5 

32 

IProver 

1.0 

128 

Metis 

2.3 

32 


Table 1. ATP provers, their versions and arguments 


identifiers of its closest parent in this tree. Then, each of 
these identifiers is also split maximally. In case of a virtual 
conjunction (see the SPEC rule above), the corresponding 
node does not exist in the theorem tree, so we take the con¬ 
junct corresponding to its closest parent. Finally, for each 
conjunct, we obtain a set of dependencies by taking the 
union of the split identifiers. 

Example 9. (Recovering dependencies from the named the¬ 
orem Thl) 

ThO (named theorem): A A B 
Thl (named theorem): C A (D A E) 

with dependency tree Tree( [ThO],[ThO_cl]) 

Recovering dependencies of each conjunct 
Thl_cO: ThO 
Thl_cl: ThO_cl 
Thl_c2: ThO_cl 

Splitting the dependencies 
Thl_cO: ThO_cl Th0_c2 
Thl_cl: ThO_cl 
Thl_c2: ThO_cl 

3. Evaluation 

In this section we describe the setting used in the experi¬ 
ments: the ATPs, the transformation from HOT to the for¬ 
mats of the ATPs, the dependencies accessible in the differ¬ 
ent experiments, and the features used for machine learning. 

3.1 ATPs and problem transformation 

HOL(y)Hammer supports the translation to the formats of 
various TPTP ATPs: FOF, TFFl, THFO, and two exper¬ 
imental TPTP extensions. In this paper we restrict our¬ 
selves to the first order monomorphic logic, as these ATPs 
have been the most powerful so far and integrating them 
in H0L4 already poses an interesting challenge. The trans¬ 
formation that HOL(y)Hammer uses is heavily influenced by 
previous work by Paulson and Harrison [^. It is de¬ 
scribed in detail in [Q, here we remind only the crucial 
points. Abstractions are removed by /^-reduction followed by 
A-lifting, predicates as arguments are removed by introduc¬ 
ing existentially quantified variables and the apply functor 
is used to reduce all applications to first-order. By default 
HOL(y)Hammer uses the tagged polymorphic encoding [^: a 
special tag taking two arguments is introduced, and applied 
to all variable instances and certain applications. The first 
argument is the first-order flattened representation of the 
type, with variables functioning as type variables and the 
second argument is the value itself. 

The initially used provers, their versions and default num¬ 
bers of premises are presented in Table[T] The HOL Light ex¬ 
periments [Q showed, that different provers perform best 






with different given numbers of premises. This is particu¬ 
larly visible for the ATP provers that already include the 
relevance filter SInE 0 , therefore we preselect a number of 
predictions used with each prover. Similarly, the strategies 
that the ATP provers implement are often tailored for best 
performance on the TPTP library, for the annual CASC 
competition [^. For ITP originating problems, especially 
for E-prover different strategies are often better, so we run 
it under the alternate scheduler Epar [^ . 

3.2 Accessible facts 

As HOL(y)Hammer has initially been designed for HOL Light, 
it treats accessible facts in the same way as the accessibility 
relation defined there: any fact that is present in a theory 
loaded chronologically before the current one is available. In 
H0L4 there are explicit theory dependencies, and as such 
a different accessibility relation is more natural. The facts 
present in the same theory before the current one, and all 
the facts in the theories that the current one depends on 
(possibly in a transitive way) are accessible. In this subsec¬ 
tion we discuss the four different accessible sets of lemmas, 
which we will use to test the performance of HOL(y)Hammer 
on. 

Exact dependencies (reproving) They are the closest 
named ancestors of a theorem in the proof tree. It tests 
how much HOL(y)Hammer could reprove if it had perfect 
predictions. In this settings no relevance filtering is done, as 
the number of dependencies is small. 

Transitive dependencies They are all the named ances¬ 
tors of a theorem in the proof tree. It simulates proving a 
theorem in a perfect environment, where all recorded the¬ 
orems are a necessary step to prove the conjecture. This 
corresponds to a proof assistant library that has been refac¬ 
tored into little theories [^. 

Loaded theorems All theorems present in the loaded the¬ 
ories are provided together with all the theorems previously 
built in the current theory. This is the setting used when 
proving theorems in H0L4, so it is the one we use in our 
interactive version presented and evaluated in Section EM 

Linear order For this experiment, we additionally recorded 
the order in which the H0L4 theories were built, so that we 
could order all the theorems of the standard library in a sim¬ 
ilar way as HOL Light theorems are ordered. All previously 
derived theorems are provided. 

3.3 Features 

Machine learning algorithms typically use features to define 
the similarity of objects. In the large theory automated rea¬ 
soning setting features need to be assigned to each theorem, 
based on the syntactic and semantic properties of the state¬ 
ment of the theorem and its attributes. 

HOL(y)Hammer represents features by strings and char¬ 
acterizes theorems using lists of strings. Features originate 
from the names of the type constructors, type variables, 
names of constants and printed subterms present in the con¬ 
clusion. An important notion is the normalization of the fea¬ 
tures: for subterms, their variables and type variables need 
to be normalized. Various scenarios for this can be consid¬ 
ered: 

• All variables are replaced by one common variable. 

• Variables are replaced by their de Bruijn index num¬ 
bers [13]. 


• Variables are replaced by their (variable-normalized) 

types [l5|| . 

The union of the features coming from the three above 
normalizations has been the most successful in the HOL 
Light experiments, and it is used here as well. 

3.4 Predictors 

In all our experiments we have used the modihed k-NN algo¬ 
rithm [l3|. This algorithm produces the most precise results 
in the HOL(y)Hammer experiments for HOL Light [Q. Given 
a fixed number (fc), the k-nearest neighbours learning algo¬ 
rithm finds k premises that are closest to the conjecture, 
and uses their weighted dependencies to find the predicted 
relevance of all available facts. All the facts and the conjec¬ 
ture are interpreted as vectors in the n-dimensional feature 
space, where n is the number of all features. The distance 
between a fact and the conjecture is computed using the 
Euclidean distance. In order to find the neighbours of the 
conjecture efhciently, we store an association list mapping 
features to theorems that have those features. This allows 
skipping the theorems that have no features in common with 
the conjecture completely. 

Having found the neighbours, the relevance of each avail¬ 
able fact is computed by summing the weights of the neigh¬ 
bours that use the fact as a dependency, counting each neigh¬ 
bour also as its own dependency 

4. Experiments 

In this section, we present the results of several experiments 
and discuss the quality of the advice system based on these 
results. The hardware used during the reproving and acces¬ 
sibility experiments is a 48-core server (AMD Opteron 6174 
2.2 GHz. CPUs, 320 GB RAM, and 0.5 MB L2 cache per 
CPU). In these experiments, each ATPs is run on a single 
core for each problem with a time limit of 30 seconds. The 
reconstruction and interactive experiments were run on a 
laptop with a Intel Core processor (i5-3230M 4 x 2.60GHz 
with 3.6 GB RAM). 

4.1 Reproving 

We hrst try to reprove all the 9434 theorems in the HOL4 
libraries with the dependencies extracted from the proofs. 
This number is lower than the number of exported theo¬ 
rems because definitions are discarded. Tabled] presents fhe 
success rates for reproving using the dependencies recorded 
without splitting. In this experiment we also compare many 
provers and their versions. For E-prover [^, we also compare 
its different scheduling strategies [l^j. The results are used 
to choose the best versions or strategies for the selected few 
provers. Apart from the success rates, the unique number of 
problems is presented (proofs found by this ATP only), and 
CVC4 Pi seems to perform best in this respect. The trans¬ 
lation used by default by HOL(y)Hammer is an incomplete 
one (it gives significantly better results than complete ones), 
so some of the problems are counter-satishable. 

From this point on, experiments will be performed only 
with the best versions of three provers: E-prover, Vam¬ 
pire [l3|, and z3 P. They have a high success rate combined 
with an easy way of retrieving the unsatisfiable core. The 
same ones have been used in the HOL(y)Hammer experi¬ 
ments for HOL Light. 

In Table (3] we try to reprove conjuncts of these theo¬ 
rems with the different recording methods described in Sec- 
tion l2.3l First, we notice that only z3 benehts from the track- 


ing of more accurate dependencies. More, removing the un¬ 
necessary conjuncts worsen the results of E-prover and Vam¬ 
pire. One reason is that E-prover and Vampire do well with 
large number of lemmas and although a conjunct was not 
used in the original proof it may well be useful to these 
provers.Suprisingly, the percentage of reproved facts did not 
increase compared to Tabled as this was the case for HOL 
Light experiments. By looking closely at the data, we notice 
the presence of the quantHeuristics theory, where 85 the¬ 
orems are divided into 1538 conjuncts. As the percentage of 
reproving in this theory is lower than the average (16%), the 
overall percentage gets smaller given the increased weight of 
this theory. Therefore, we have removed the quantHeuris- 
tic theory in the Basic* and Optimized* experiments for a 
fairer comparison with the previous experiments. Finally, if 
we compare the Optimized experiment with the similar HOL 
Light reproving experiment on 14185 Flyspeck problems 
we notice that we can reprove three percent more theorems 
in H0L4. This is mostly due to a 10 percent increase in the 
performance of z3 on HOL4 problems. 

In Tabled] we have compared the success rates of reprov¬ 
ing in different theories, as this may represent a relative 
difficulty of each theory and also the relative performance 
of each prover. We observe that z3 performs best on the 
theories measure and probability, list and finite_map, 
whereas E-prover and Vampire have a higher success rate on 
the theories arithmetic, real, complex and sort. Overall, 
the high success rate in the arithmetic and real theories 
confirms that HOL(y)Hammer can already tackle this type of 
theorems. Nonetheless, it would still benefit from integrat¬ 
ing more SMT-solvers’ functionalities on advanced theories 
based on real and arithmetic. 

4.2 With different accessible sets 

In TableOwe compare the quality of the predictions in differ¬ 
ent proving environments. We recall that only the transitive 
dependencies, loaded theories and linear order settings are 
using predictions and that the number of these predictions 
is adapted to the ability of each provers. The exact depen¬ 
dencies setting (reproving), is copied from Table [3] for easier 
comparison. 


Prover 

Version 

Theorem) %) 

Unique 

CounterSat 

E-prover 

Epar 3 

44.45 

3 

0 

E-prover 

Epar 1 

44.15 

9 

0 

E-prover 

Epar 2 

43.95 

9 

0 

E-prover 

Epar 0 

43.52 

2 

0 

CVC4 

1.3 

42.71 

44 

0 

z3 

4.32 

41.96 

8 

5 

z3 

4.40 

41.65 

1 

6 

E-prover 

1.8 

41.37 

14 

0 

Vampire 

2.6 

41.10 

14 

0 

Vampire 

1.8 

38.34 

6 

0 

z3 

4.40q 

35.19 

11 

5 

Vampire 

3.0 

34.82 

0 

0 

Spass 

3.5 

31.67 

0 

0 

Metis 

2.3 

29.98 

0 

0 

1 Prover 

1.0 

25.52 

2 

35 


total 

50.96 

38 



Table 2. Reproving experiment on the 9434 unsplit theo¬ 
rems of the standard libary 



Basic 

Optimized 

Basic* 

Optimized* 

E-prover 

42.43 

42.41 

46.23 

45.91 

Vampire 

39.79 

39.32 

43.24 

42.41 

z3 

39.59 

40.63 

43.78 

44.18 

total 

46.74 

46.76 

50.97 

50.55 


Table 3. Success rates of reproving (%) on the 13910 con¬ 
juncts of the standard library with different dependency 
tracking mechanism. 

We first notice the lower success rate in the transitive 
dependencies setting. There may be two justifications. First, 
the transitive dependencies provide a poor training set for 
the predictors; the set of samples is quite small and the 
available lemmas are all related to the conjecture. Second, 
it is very unlikely that a lemma in this set will be better 
than a lemma in the exact dependencies, so we cannot hope 
to perform better than in the reproving experiment. 

We now focus on the loaded theories and linear order 
settings, which are the two scenarios that correspond to the 
regular usage of a “hammer” system in a development: given 
all the previously known facts try to prove the conjecture. 
The results are surprisingly better than in the reproving ex¬ 
periment. First, this indicates that the training data coming 
from a larger sample is better. Second, this shows that the 
HOL4 library is dense and that closer dependencies than the 
exact one may be found by the predictors. It is quite com¬ 
mon that large-theory automated reasoning techniques find 
alternate proofs. Third, if we look at each ATP separately, 
we see a one percent increase for E-prover, a one percent 
decrease for Vampire, and 9 percent decrease for z3. This 
correlates with the number of selected premises. Indeed, it 
is easy to see that if a prover performs well with a large 
number of selected premises, it has more chance to find the 
relevant lemmas. Finally, we see that each of the provers 
enhanced the results by solving different problems. 

We can summarize the results by inferring that predictors 
combined with ATPs are most effective in large and dense 
developments. 

The linear order experiments was also designed to make 
a valid comparison with a similar experiment where 39% of 
Flyspeck theorems were proved by combining 14 methods 
This number was later raised to 47% by improving the 
machine learning algorithm. Comparatively, the current 3 



arith 

real 

compl 

meas 

E-prover 

61.29 

72.97 

91.22 

27.01 

Vampire 

59.74 

69.57 

77.19 

20.85 

z3 

51.42 

64.46 

86.84 

31.27 

total 

63.63 

75.31 

92.10 

32.70 


proba 

list 

sort 

f_map 

E-prover 

42.16 

23.56 

34.54 

33.07 

Vampire 

37.34 

21.96 

32.72 

27.16 

z3 

54.21 

25.62 

25.45 

43.70 

total 

55.42 

26.77 

40.00 

45.27 



Table 4. Percentage (%) of reproved theorems in the the¬ 
ories arithmetic, real, complex, measure, probability, 
list, sorting and finte_map. 






















ED 

TD 

LT 

LO 

E-prover 

42.41 

33.10 

43.58 

43.64 

Vampire 

39.32 

29.56 

38.46 

38.54 

z3 

40.63 

24.66 

31.22 

31.20 

total 

46.76 

37.54 

50.54 

50.68 


Table 5. Percentage (%) of proofs found using different ac¬ 
cessible sets: exact dependencies (ED), transitive dependen¬ 
cies (TD), loaded theories (LT), and linear order (LO) 

methods can prove 50% of the HOL4 theorems. This may 
be since the machine learning methods have improved, since 
the ATPs are stronger now or even because the Flyspeck 
theories contain a more linear (less dense) development than 
the HOL4 libraries, which makes it harder for automated 
reasoning techniques. 

4.3 Reconstruction 

Until now all the ATP proved theorems could only be used 
as oracles inside HOL4. This defeats the main aim of the 
ITP which is to guarantee the soundness of the proofs. The 
provers that we use in the experiments can return the unsat- 
isfiable core: a small set of premises used during the proof. 
The HOL representation of these facts can be given to Metis 
in order to reprove the theorem with soundness guaranteed 
by its construction. We investigate reconstructing proofs 
found by Vampire on the loaded theories experiments (used 
in our interactive version of HOL(y)Hammer). We found that 
Metis could reprove, with a one second time limit, 95.6% of 
these theorems. This result is encouraging for two reasons: 
First, we have not shown the soundness of our transforma¬ 
tions, and this shows that the found premises indeed lead to 
a valid proof in HOL. Second, the high reconstruction rate 
suggest that the system can be useful in practice. 

4.4 Case study 

Finally, we present two sets of lemmas found by E-prover 
advised on the loaded libraries. We discuss the difference 
with the lemmas used in the original proof. 

The theorem EULER_F0RMULE states that any complex 
number can be represented as a combination of its norm 
and argument. In the human-written proof script ten theo¬ 
rems are provided to a rewriting tactic. The user is mostly 
hindered by the fact that she could not use the commuta¬ 
tivity of multiplication as the tactic would not terminate. 
Free of these constraints, the advice system returns only 
three lemmas: the commutativity of multiplication, the polar 
representation COMPLEX_TRIANGLE, and the Euler’s formula 
EXP.IMAGINARY. 

Example 10. (In theory complex) 

Original proof: 

val EULER_F0RMULE = store_thm("EULER_FORMULE", 

‘‘!z:complex. modu z * exp (i * arg z) = z‘‘, 
REWRITE_TAC[complex_exp, i, complex_scalar_rmul, 
RE, IM, REAL_MUL_LZER0, REAL_MUL_LID, EXP_0, 
C0MPLEX_SCALAR_LMUL_0NE, C0MPLEX_TRIANGLE]); 

Discovered lemmas: 

C0MPLEX_SCALAR_MUL_C0MH COMPLEX_TRIANGLE 
EXP_IMAGINARY 

The theorem LCM_LEAST states that any number below the 
least common multiple is not a common multiple. This seems 


trivial but actually the least common multiple (Icm) of two 
natural numbers is defined as their product divided by their 
greatest common divisor. The user has proved the contrapo¬ 
sition which requires two Metis calls. The discovered lemmas 
seem to indicate a similar proof, but it requires more lem¬ 
mas, namely FALSITY and IMP_F_EQ_F as the false constant is 
considered as any other constant in HOL(y)Hammer and uses 
the combination of LCM_C0MM and N0T_LT_DIVIDES instead of 
DIVIDES_LE. 

Example 11. (In theory gcd) 

Original proof: 

val LCM_LEAST = store_thm("LCM_LEAST", 

' ‘0 < m A 0 < n ==> !p. 0 < p A p < 1cm m n 
==> "(divides m p) V “(divides n p)‘‘, 

REPEAT STRIP_TAC THEN SPQSE_NQT_THEN 
STRIP_ASSUME_TAC THEN ‘divides (1cm m n) p‘ 
by METIS_TAC [LCM_IS_LEAST_COMMON_MULTIPLE] 

THEN ‘1cm m n <= p‘ by METIS_TAC [DIVIDES_LE] 

THEN DECIDE_TAC); 

Discovered lemmas: 

LCM_IS_LEAST_COMMON_MULTIPLE LCM_C0MM 
N0T_LT_DIVIDES FALSITY IMP_F_EQ_F 

4.5 Interactive version 

In our previous experiments, all the different steps (export, 
learning/predictions, translation, ATPs) were performed 
separately, and simultaneously for all the theorems. Here, we 
compose all this steps to produce one HOL4 step, that given 
a conjecture proves it, usable in any HOL4 development in 
an interactive advice loop. It proceeds as follows: The con¬ 
jecture is exported along with the currently loaded theories. 
Features for the theorems and the conjecture are computed, 
and dependencies are used for learning and selecting the 
theorems relevant to the conjecture. HOL(y)Hammer trans¬ 
lates the problem to the formats of the ATPs and uses them 
to prove the resulting problems. If successful, the discovered 
unsatisfiable core, consisting of the HOL4 theorems used in 
the ATP proof, is then read back to HOL4, returned as a 
proof advice, and replayed by Metis. 

In the last experiment, we evaluate the time taken 
by each steps on two conjectures, which are not already 
proved in the HOL4 libraries. The first tested goal Ci is 
gcd {gcd a a) {b + a) = (gcd b a), where gcd n m is the 
greatest common divisor of n and m. It can be automati¬ 
cally proved from three lemmas about gcd. The second goal 
is C 2 is Im{i * i) =0, where Im the imaginary part of a 
complex number. It can be automatically proved from 12 
lemmas in the theories real, transc and complex. 

In Table [6l the time taken by the export and import 
phase linearly depends on the number of theorems in the 
loaded libraries (given in parenthesis), as expected by the 
knowledge of our data and the complexity analysis of our 
code. 

The time shown in the fourth column (“Predict”) includes 
the time to extract features, to learn from the dependencies 
and to find 96 relevant theorems. The time needed for 
machine learning is relatively short. The time taken by 
Vampire shows that the second conjecture is harder. This 
is backed by the fact that we could not tell in advance what 
would be the necessary lemmas to prove this conjecture. The 
overall column presents the time between the interactive call 
and the display of advised lemmas. The low running times 







support the fact that our tool is fast enough for interactive 
use. 



Export 

Import 

Predict 

Vampire 

Total 

Cl (2224) 
C2 b056) 

0.38 

0.67 

0.20 

0.43 

0.29 

0.59 

0.01 

1.58 

0.97 

3.42 



Table 6. Time (in seconds) taken by each step of the advice 
loop 


5. Conclusion 

In this paper we present an adaptation of the HOL(y)Hammer 
system for H0L4, which allows for general purpose learning- 
assisted automated reasoning. As HOL(y)Hammer uses ma¬ 
chine learning for relevance filtering, we need to compute 
the dependencies, define the accessibility relation for theo¬ 
rems and adapt the feature extraction mechanism to H0L4. 
Further, as we export all the proof assistant data (types, 
constants, named theorems) to a common format, we define 
the namespaces to cover both HOL Light and H0L4. 

We have evaluated the resulting system on the H0L4 
standard library toplevel goals: for about 50% of them a 
sufficient set of dependencies can be found automatically. 
We compare the success rates depending on the accessibil¬ 
ity relation and on the treatment of theorems whose state¬ 
ments are conjunctions. We provide a HOL4 command that 
translates the current goal, runs premise selection and the 
ATP, and if a proof has been found, it returns a Metis call 
needed to solve the goal. The resulting system is available 
at https://github.com/barakeel/HOL 

5.1 Future Work 

The libraries of HOL Light and HOL4 are currently processed 
completely independently. We have however made sure that 
all data is exported in the same format, so that same con¬ 
cepts and theorems about them can be discovered automat¬ 
ically 0. By combining the data, one might get goals in 
one system solved with the help of theorems from the other, 
which can then be turned into lemmas in the new system. 
A first challenge might be to define a combined accessibility 
relation in order to evaluate such a combined proof assistant 
library. 

The format that we use for the interchange of HOL4 and 
HOL Light data is heavily influenced by the TPTP formats 
for monomorphic higher-order logic and polymorphic 
first-order logic @]. It is however slightly different from that 
used by Sledgehammer’s fullthf. By completely standard¬ 
izing the format, it would be possible to interchange prob¬ 
lems between Sledgehammer and HOL(y)Hammer. 

In HOL4, theorems include the information about the 
theory they originate from and other attributes. It would 
be interesting to evaluate the impact of such additional 
attributes used as features for machine learning on the 
success rate of the proofs. Finally, most HOL(y)Hammer 
users call its web interface rather than locally install 
the necessary prover modifications, proof translation and 
the ATP provers. It would be natural to extend the web 
interface to support HOL4. 
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